-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Location Providers #1452
base: main
Are you sure you want to change the base?
Conversation
LocationProvider
smodule_name, class_name = ".".join(path_parts[:-1]), path_parts[-1] | ||
module = importlib.import_module(module_name) | ||
class_ = getattr(module, class_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, wonder if we should reduce duplication between this and file IO loading.
@@ -2622,13 +2631,15 @@ def _dataframe_to_data_files( | |||
property_name=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES, | |||
default=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT, | |||
) | |||
location_provider = load_location_provider(table_location=table_metadata.location, table_properties=table_metadata.properties) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't love this. I wanted to do something like this and cache on at least the Transaction
(which this method is exclusively invoked by) but the problem I think is that properties can change on the Transaction
, potentially changing the location provider to be used. I suppose we can update that provider on a property change (or maybe any metadata change) but unsure if this complexity is even worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats an interesting edge case. it seems like an anti-pattern to change the table property and write in the same transaction, although its currently allowed
from pyiceberg.utils.properties import property_as_bool | ||
|
||
|
||
class DefaultLocationProvider(LocationProvider): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The biggest difference vs the Java implementations is that I've not supported write.data.path
here. I think it's natural for write.metadata.path
to be supported alongside this so this would be a larger and arguably location-provider-independent change? Can look into it as a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! would be great to have write.data.path
and write.metadata.path
@@ -192,6 +195,14 @@ class TableProperties: | |||
WRITE_PARTITION_SUMMARY_LIMIT = "write.summary.partition-limit" | |||
WRITE_PARTITION_SUMMARY_LIMIT_DEFAULT = 0 | |||
|
|||
WRITE_LOCATION_PROVIDER_IMPL = "write.location-provider.impl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though the docs say that the default is null
, having a constant for this being None
felt unnecessary
return ( | ||
f"{prefix}/{hashed_path}/{data_file_name}" | ||
if self._include_partition_paths | ||
else f"{prefix}/{hashed_path}-{data_file_name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting that disabling include_partition_paths
affects paths of non-partitioned data files. I've matched Java behaviour here but it does feel odd.
TableProperties.WRITE_OBJECT_STORE_PARTITIONED_PATHS_DEFAULT, | ||
) | ||
|
||
def new_data_location(self, data_file_name: str, partition_key: Optional[PartitionKey] = None) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to make this as consistent with its Java counter-part so file locations are consistent too. This means hashing on both the partition key and the data file name below, and using the same hash function.
Seemed reasonable to port over the the object storage stuff in this PR, given that the original issue #861 mentions this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Iceberg is mainly focussed on object-stores, I'm leaning towards making the ObjectStorageLocationProvider
the default. Java is a great source of inspiration, but it also holds a lot of historical decisions that are not easy to change, so we should reconsider this at PyIceberg.
tests/table/test_locations.py
Outdated
# Field name is not encoded but partition value is - this differs from the Java implementation | ||
# https://github.com/apache/iceberg/blob/cdf748e8e5537f13d861aa4c617a51f3e11dc97c/core/src/test/java/org/apache/iceberg/TestLocationProvider.java#L304 | ||
assert partition_segment == "part#field=example%23val" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put up #1457 - I'll remove this special-character testing (that the Java test counterpart does) here because it'll be tested in that PR.
return f"custom_location_provider/{data_file_name}" | ||
|
||
|
||
def test_default_location_provider() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests in this file are inspired by https://github.com/apache/iceberg/blob/main/core/src/test/java/org/apache/iceberg/TestLocationProvider.java.
The hash functions are the same so those constants are unchanged.
fc674f4
to
d9e6c6a
Compare
fcea1ec
to
23ef8f5
Compare
@Fokko, think this is ready for review now! I've implemented this for write codepaths - |
@@ -1627,6 +1632,67 @@ class AddFileTask: | |||
partition_field_value: Record | |||
|
|||
|
|||
class LocationProvider(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also expect this one to be in location.py
? The table/__init__.py
is already pretty big
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Generally LGTM, i left a few nit comments.
This matches the behavior of the Java implementation. However, if we're reusing the same property (write.location-provider.impl
), then there's a conflict when loading in both Java and Python. I wonder if we should add a python specific property, otherwise location-provider will only work in one of the implementations and might error in the other.
from pyiceberg.utils.properties import property_as_bool | ||
|
||
|
||
class DefaultLocationProvider(LocationProvider): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! would be great to have write.data.path
and write.metadata.path
HASH_BINARY_STRING_BITS = 20 | ||
ENTROPY_DIR_LENGTH = 4 | ||
ENTROPY_DIR_DEPTH = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move these into ObjectStoreLocationProvider
@@ -2622,13 +2631,15 @@ def _dataframe_to_data_files( | |||
property_name=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES, | |||
default=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT, | |||
) | |||
location_provider = load_location_provider(table_location=table_metadata.location, table_properties=table_metadata.properties) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats an interesting edge case. it seems like an anti-pattern to change the table property and write in the same transaction, although its currently allowed
Closes #861.
As the issue suggests, introduces a
LocationProvider
interface with the default and object-store-optimised implementations (the latter can be enabled via the newly-introduced table properties). This is pluggable, just like FileIO.Largely inspired by and consistent with the Java implementation.